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ABSTRACT 



Stochastic models are developed in this article to examine 



the rate of test misgrading in educational and psychological measurement. The 
estimation of inadvertent grading errors can serve as a basis for quality 
control in measurement. Limitations of traditional Poisson models have been 
reviewed to highlight the need to introduce new models using well established 
geometric and negative binomial distributions. Equations are developed for 
the use of geometric and negative binomial distributions in the study of test 
misgrading. In this study, the geometric process is developed from a 
single-grader scenario under a policy of zero tolerance for test misgrading. 
The negative binomial process seems appropriate for state or national 
assessment that involves more than one test grader. Results of this 
investigation can be used to ensure the number of misgraded events below a 
threshold k. Features of the quality control measures are discussed in this 
article in a context of local and national assessments. (Contains 22 
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Stochastic Models of Quality Control on Test Misgrading 



Abstract 

Stochastic models are developed in this article to examine the rate of test 
misgrading in educational and psychological measurement. Limitations of traditional 
Poisson models have been reviewed to highlight the need of introducing new models 
using well established geometric and negative binomial distributions. Results of this 
investigation can be employed to ensure the number of misgraded events below a 
threshold k. Features of the quality control measures are discussed in this article in a 
context of local and national assessments. 
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Stochastic Models of Quality Control on Test Misgrading 

In the last decade, essay items have been incorporated in major educational 
. assessments, such as the National Assessment of Educational Progress (NAEP) and the 
Third International Mathematics and Science Study (TIMSS) (Allen, Carlson, & Zelenak, 
1999; Martin & Kelly, 1996). Meanwhile, classroom teachers are urged to use essay 
questions to complement multiple-choice items. Various responses generated from essay 
items demand a large amount of manpower in test grading. While no graders intend to 
make mistakes, accidental errors are likely to occur during the human operations (Wang, 
1993). The purpose of this study is to examine the chance of test misgrading using 
appropriate models in statistics. The estimation of inadvertent grading errors can serve as 
a basis for quality control in educational and psychological measurements. 

Literature Review 

Statistical models have been sought to enhance quality control in various projects. 

In industrial statistics, quality control measures are adopted mainly to ensure the total 

number of inferior incidents below a threshold k. Bissssell (1970) reviewed, 

Incident counts form an important class of data, arising particularly in 
manufacturing processes and accident studies. ... It is often assumed that such 
events follow the Poisson Law. The assumptions of constant mean level and 
independence are often violated in practice, (p. 215) 

In educational and psychological measurements, test misgrading can be treated as 
a specific type of incidents. In a classroom setting, Lyman (1998) noted that “Every 
teacher recognizes that grades are somewhat arbitrary and subjective” (p. 107). In a 
large-scale assessment, it is even more difficult to assume the same level of average 
performance among various graders. Accordingly, the assumption of a constant mean 
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performance level is often violated in small- and large-scale assessments, which makes 
the Poisson model unsuitable for most real-life applications (Rasch, 1980; Wang, 1993). 

Whenever the assumption of Poisson distribution does not hold, statisticians tend 
to adopt alternative models to strengthen the quality control process. In particular, 
Johnson and Kotz (1969) pointed out, “The negative binomial distribution is very often a 
first choice as alternative when it is felt that a Poisson distribution might be inadequate” 
(p. 125). Edward and Gurland (1961) compared a class of distributions applicable to 
accidents, and reported that “the negative binomial gives an appreciably better fit than the 
Poisson distribution” (p. 504). Nonetheless, the negative binomial model has yet to be 
adopted in education to analyze test misgrading (Rasch, 1980; Wang, 1993). 

In contrast, researchers in other fields have applied the negative binomial 
distribution on a wide range of topics. Bamwal and Paul (1988) reviewed applications of 
negative binomial models on count data, and noted that “Count data which follow the 
negative binomial distribution arise in numerous areas of biostatistics (Anscombe, 1949; 
Bliss & Fisher, 1953; Bliss & Owen, 1958; McCaughran & Arnold, 1976)” (p. 215). In 
military industries, much earlier applications have been made by Greenwood and Yule 
(1920). They reported that the negative binomial distribution gave a better fit than did 
the Poisson distribution to accidents in munitions factories in England during the First 
World War. Besides the count data, Ross and Preece (1985) added that “The negative 
binomial distribution is often appropriate for data for aggregated organism; it can arise 
from various different models (Anscombe, 1950, p. 360; Bliss, 1953, p. 185ff; Boswell & 
Patil, 1970; Freeman, 1980)” (p. 323). The various applications have resulted in different 
presentations of the negative binomial distribution. Consequently, as was noted by 
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Bamwal and Paul (1988), “Different authors have expressed the negative binomial 
distribution in different forms” (p. 215), causing substantial confusion in its applications. 

Matloff (1988) further examined connections between the negative binomial 
distribution and other statistical models, and reported, “In spite of the fact the name of 
this family contains the word binomial , it is related more closely to the geometric family 
than to the binomial family” (p. 83). However, various alternative presentations have 
also been made in the statistical literature for the geometric distribution (e.g., Casella & 
Berger, 1990, p. 74 & p. 625). To avoid the distraction on the notation differences, 
geometric and negative binomial models have been introduced in this study to estimate 
the rate of test misgrading in education. Criteria of quality control have been considered 
to differentiate the models in various settings. 

Stochastic Models 

In a test scoring process, a contrast can be set to differentiate outcomes of correct- 
grading and misgrading. An event with dichotomous outcomes is typically modeled by a 
Bernoulli trial. For a well-designed test, the chance of misgrading (p) is not high. 

Quality control measures, such as arrangement of schedules for short breaking, can be 
introduced in the grading process to ensure that the number of misgraded cases is no 
larger than a specific level k. In practice, the grading process may continue until 
occurrence of the kth misgrading. By then, a break session can be scheduled to refresh 
the graders, and thus, help control the number of misgrading below level k. 

To facilitate description of the stochastic model, one may define X to be the total 
number of successes before the kth misgrading, and f(x) to be the probability of obtaining 
exactly X successes. Accordingly, the total number of trials (X+k) depends on the 
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threshold level k and the number of correctly-graded cases (X) before reaching the 
threshold. Because the event of misgrading happens by accident, the number of 
correctly-graded cases (X) may vary among the graders. Given a level of the threshold k, 
the expected value of X can be employed to schedule break sessions before reaching a 
misgrading incident on the (X+k)th trial . 

Geomatric Stochastic Process 

Under a condition of zero tolerance, one may wish to schedule a break period for 

test graders before the first occurrence of misgrading. Using symbol s to represent 

successful test grading and m to represent the first misgrading, one may describe the 

stochastic process in the following chain of events: 

ss s m 

X times 

This stochastic chain can occur in only one way, i.e., the grader has successfully graded 
test questions on the first X trials, and ends up with the first misgrading case on the 
(X+l)th trial. Therefore, the chance of obtaining exactly X successes before the first 
failure is: 

m = (i- p) n- p) ... (i-p) p 

X times 

where p is the probability of misgrading in each trial. 

This stochastic process can be described by a probability function for 
different values of X: 

f(x) = (l-p) x p; x = 0,1,2, (1) 

Equation (1) defines a geometric process “because the probabilities form a geometric 
series with a common ratio 1-p” (Kalbfleisch, 1979, p. 118). Since the chance of 
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misgrading in each trial (p) is less than 1, the total probability follows: 
[l+(l-/?)+(l-/?) 2 +...]p=-jq£^j=l 
which confirms the feature of 



£[/(*)] = 1 ( 2 ) 

*=0 

While the expected value of X is an important index for setting break sessions in a 
test grading process, it is not so simple to derive E(X) using a definition formula 
E(X) = Z [X f(x)] (Casella & Berger, 1990). Fortunately, equation (2) provides a short 
cut for the statistical inference. Take derivatives on both sides of (2), and one may get 



-rB/M]=o 

dp ,=o 

The left hand side can be adjusted as 



*=0 



i£[/«]=£i [/«) 

jc=0 

=£i[d-p) x p] 

x =0 

-£ [(i-p) x - x (i p)*-' P ] 



jc=0 



( 3 ) 



=i£a-p)'p-T^£4i-p»] 

jc=0 jc=0 




_ _r=0 

P 



E{X) 
1 ~P 



Following equations (2) and (3), one may get 



1 E(X) _ n 
P 1 ~P ^ 
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Thus, the expected number of correctly graded cases is: 

E{X) = '-f (4) 

Because a break session is arranged before occurrence of the first misgrading (i.e., k=l), 
the waiting time for the first misgrading is: 

E{X + k) = E{X + \) = E( < X) + \=^ + \=± ; (5) 

Although the geometric distribution has been discussed in many books, some 
books presented the expected value in equation (4) (e.g., Bhat, 1984, p. 106; Casella & 
Berger, 1990, p. 74; Feller, 1957, p. 210; Port, 1994, p. 247) and others gave the result in 
equation (5) (e.g., Matloff, 1988, p. 83; Ewart, Ford, & Lin, 1974, p. 322; Draper, & 
Lawrence, 1970, p. 120). No authors clarified the variable difference between equations 
(4) and (5).. In this study, the differentiation of X and (X+k) shows agreement of both 
presentations in the same stochastic process. 

Results in equation (5) can be interpreted in a context of test misgrading. For a 
given test, if the chance of misgrading (p) is small, then the waiting time for a break 
session can be longer. In an extreme case, multiple-choice tests are graded by a machine 
which has p equal to zero in each trial. According to equation (5), the waiting time can 
be infinite, and thus, there is no need for a rest unless the grading machine is broken 
down. 

A Negative Binomial Model 

In a more complicated situation, a test is graded by a total number of n graders. 
For any grader i, the threshold kj depends on the quality control requirement, and may 
take a value larger than one. To ensure the kjth misgrading event occurring on the 
(ki+Xj)th trial, the previous (kj-1) misgrading events must already happen in the 
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preceding (kj+Xj-1) trials. The probability of obtaining (k* -1) misgrading cases on the 
first (kj+Xj-1) trials is given by 



k,+X,- A 



/(■*) = 



p k ‘-\i- p y 



Xi = 0,1,2, 



(6) 



V */-' J 

where the probability of misgrading on trial (kj+Xj) is p. 

Equation (6) follows a negative binomial distribution (Feller, 1957). The 
probability generation function for the negative binomial distribution is 
p k ‘ [1 - (1 - p)t ]~ k ' . Because the overall quality control is based on cumulative 
performance of the n graders, the stochastic process involves n independent random 
variables, Xj, ... X„. Thus, the quality control threshold k hinges on the distribution of 



^Xj . Fortunately, the probability generation function for follows 

i=i i=i 



p. (0 = fl P k ‘ [1 - (1 ■ - P)t]' k ‘ =/>*[!-(!- P)ty k (7) 

Z x l i -1 
/=1 

where k=ki + . . . + k„. Based on uniqueness of the probability generating function (Port, 



n 

1994), ^Xj must have a negative binomial distribution with parameters k and p. 

i=i 



The expected value for ^ X t can be derived from the negative binomial 

i=i 

distribution, and the result is k(l-p)/p (see Casella & Berger, 1990). Therefore, the 
waiting time before occurrence of the kth misgrading is: 



£[(E Jf .) +t ] = i£ r 1+t = 7 < 8 > 



In a comparison between (5) and (8), one may note that the geometric process can be 
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treated as a special case (k=l) of the negative binomial distribution. Based on the 
results in (8), the waiting time for a break period can be longer if the overall tolerance 
level k is higher and the chance of misgrading (p) is small. 

In summary, partly due to differences in the notation choice, the well-established 
geometric and negative binomial distributions have yet to be used in models of test 
misgrading. In other fields, Johnson and Kotz (1969) have noted that “the negative 
binomial distribution is frequently used as a substitute for the Poisson distribution when it 
is doubtful whether the strict requirements, particularly independence, for a Poisson 
distribution will be satisfied” (p. 135). Thus, the geometric and negative binomial 
models provide alternative choices that are more flexible than the Poisson model in 
educational and psychological measurements. 

Given the connection between geometric and negative binomial distributions, 
applications of these stochastic models hinge on characteristics of a specific setting. 

The geometric process is developed from a single-grader scenario under a policy of zero 
tolerance for test misgrading. Thus, the result in equation (5) may be more applicable in 
a local setting in which a teacher has been assigned to grade tests for an entire class. The 
negative binomial process, on the other hand, seems appropriate for state or national 
assessment that involves more than one test grader. In both cases, the waiting time for 
test misgrading has been derived from the corresponding stochastic processes. The 
results can be employed to schedule break periods to ensure the error of misgrading 
below a threshold k. 
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